Skip to content

Nemo task5 6 hooks and progress reporting#9

Merged
TianyeGGBond merged 4 commits intorlops:nemofrom
yayajjiang:nemo-task5-6-hooks-and-progress-reporting
May 5, 2026
Merged

Nemo task5 6 hooks and progress reporting#9
TianyeGGBond merged 4 commits intorlops:nemofrom
yayajjiang:nemo-task5-6-hooks-and-progress-reporting

Conversation

@yayajjiang
Copy link
Copy Markdown
Collaborator

No description provided.

yayajjiang and others added 3 commits April 19, 2026 09:58
… reporting

Task 5 (F11-flag, F5-hooks):
- Add RLixHooks Protocol with 7-method interface (before/after_generation,
  before/after_training, before_weight_sync, begin/end_progress_batch)
- Add NoOpRLixHooks default implementation for standalone NeMo RL runs
- Add NemoRLRLixHooks with GPU hook placeholders (Task 7 TODOs) and
  Task 3 NCCL placeholder in before_weight_sync
- Add grpo.py stub with DO_TIME_SHARING flag and 5 hook insertion points
  in the correct per-step order

Task 6 (F9):
- Implement begin_progress_batch / end_progress_batch state machine
- 2% bucket granularity (50 buckets), deduplicated fire-and-forget emit
- _emit_progress() isolated as overridable method for testability
- _emit_progress body is TODO pending Task 7 scheduler actor wiring

Tests: 30 unit tests, all passing, no GPU/Ray/vLLM required

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tep_target_estimate bootstrap

- grpo.py: begin_progress_batch now called before before_generation each step
- TASK5_6_HOOKS.md: add chicken-and-egg problem explanation and two-mechanism solution
- test: add test_begin_progress_batch_called_before_before_generation (31 tests total)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
taoluo added a commit that referenced this pull request May 2, 2026
…Gate 4

Three rounds of audit against rlix/scheduler/scheduler.py + RLix planner +
existing MILES code surfaced 15 distinct P0 issues in the unified port
plan that would block M11.2 Gate 4 (dual-pipeline) implementation.

Bucket I — Init control flow vs RLix scheduler reality
  #1+#2  Reorder init: scheduler executes coordinator.resize_infer(add)
         RPC in Phase 5 before signaling pending GEN waiter (planner
         emits SchedGuidedAllocationOp even on first-pipeline alloc).
         All-shell RolloutManager + register_model_update_resources +
         publish_cache_ready_step(-1) + bootstrap_active_engines(empty)
         must run as Step 6.6a-f BEFORE Step 7's _request_cluster_gpus.
         Add Step 7.1-7.3 post-request validation. Empty Miles router
         started in all-shell ctor; activate_routing is sole add_worker
         entry; per-pipeline router port (no inheritance from peer).
         Coordinator gains _active_engines_bootstrapped: bool to
         distinguish empty-set bootstrap from "not yet bootstrap'd".

Bucket II — F4/F5+6 weight-refresh internal contradictions
  #3  Drop per-bucket weight_version from sender invocation, wrapper
      payload, HTTP route metadata, and receiver API. Per-sync version
      publish is the atomic unit's step (c) — once via
      manager.set_weight_version(version, target_engines).
  #4  HTTP /update_weights_from_cpu_bucket route handler must enter
      tokenizer_manager.model_update_lock.writer_lock (mirrors existing
      update_weights_from_tensor path); active-refresh's "engine keeps
      serving" semantics need this pause boundary to prevent half-new
      weights being observed mid-load.
  #5  Add RolloutManager.get_engine_handles(engine_indices) read-only
      API; relax service-manager rule to "two read-only entries +
      one write" so service can dispatch per-engine RPCs.
  #6  Replace 5 split sender RPCs with composite
      cache_owner_actor.run_sync_session(plan: SyncSessionPlan).
      _cache_lock held in single method body, not cross-RPC.
      SyncSessionPlan is fully self-contained (target_handles +
      cpu_serialize_local_ranks + broadcast_local_ranks + comm_ranks),
      cache_owner never calls back to service or manager.
  #7  Replace dist.new_group with TCP rendezvous via existing MILES
      init_process_group helper (broadcast.py:137 pattern) on sender
      side and SGLang init_custom_process_group on receivers. Cross-
      process default-PG-less topology cannot use dist.new_group.

Bucket III — Spec drift / API contradictions
  #8  F12 main body switched from get_rollout_workers() / worker_placements
      to get_all_rollout_engine_placements() (declared full table) +
      get_active_engine_indices(allocated_gpus, tp_size) (active subset
      with half-engine raise).
  #10 Gate 3 invariant (c) restated as initial_completed=0 contract,
      replacing stale "buffer ready count non-zero" wording that
      contradicts Fix #1.

Bucket IV — TrainRayActor + milestone/Gate scope
  #9  TrainRayActor.__init__ accepts optional local_rank kwarg; RLix
      path passes local_rank=0 (1 actor 1 GPU under fractional + manual
      CVD; ray.get_gpu_ids() not in manual CVD list breaks
      get_local_gpu_id() ValueError). Standalone path keeps existing
      fallback. New file-change row for miles/ray/train_actor.py.
  #11 M11.2 unified-plan scope = Gate 4 happy path only (shell partial
      allocation + RolloutManager lazy ctor + Gate 4 (c)/(d)/(e)
      acceptance). admission_epoch / orchestrator cleanup / graceful
      drain stay M11.2-tagged follow-up — not Gate 4 pass/fail criteria.
      Manual `ray stop` is accepted recovery on crash. Implementation
      order Week 3 drops Gate 4; Week 4 picks up M11.2 happy path.
  #12 Strengthened Gate 4 acceptance: (c) router worker list contains
      exactly active engine subset; (d) v=-1 short-circuit verified via
      both run_sync_session and finalize_weight_update call counters
      staying at pre-expand values; (e) donor-shrink-before-recipient-
      add ordering with T1 < T2 < T3 strict inequality.

Bucket V — Tail-end consequences of new init ordering
  #13 v=-1 service path skips run_sync_session, skips finalize fan-out,
      no cache_owner GPU touch. Required because Step 6 releases
      actor_train allocation before Step 7's GEN request — broadcast
      H2D staging on cache_owner would race the just-released GPU.
      Service runs only manager.set_weight_version(-1, target_engines).
  #14 F10 RLix-mode startup fail-fast on resume args:
      args.load in {None, args.hf_checkpoint} AND
      args.ref_load in {None, args.hf_checkpoint}. Fix #13's no-transport
      assumes train and rollout base weights are equivalent; resume /
      non-equivalent ref_load would diverge them. CPU-only base sync
      for resume is M11.2 follow-up; for now fail-fast.
  #15 v=-1 path also skips engine.finalize_weight_update — SGLang's
      onload-time process_weights_after_loading already ran during
      manager.expand_engines lazy-spawn; re-running on already-loaded
      weights is redundant.

Cross-cutting cleanups
  - F4 _cache_lock invariant clarified as in-method (run_sync_session
    body), not cross-RPC.
  - F5+6 atomic-unit description rewritten to spell the (a)/(b)/(c)
    steps and the v=-1 short-circuit.
  - Receiver API table tightened: weight_version dropped from
    update_weights_from_cpu_bucket signature; finalize_weight_update
    row notes v=-1 skip.
  - SyncSessionPlan dataclass spec includes M11.2 narrowing note —
    Gate 4 happy-path acceptance runs cpu_serialize-only topology;
    NCCL broadcast complexity stays as M11.1 load-bearing structure
    (Gate 2.5 unchanged), M11.2 doesn't add transport work.
  - F12 placement provider rewrite + standalone path preserved.
  - Implementation follow-up table gains "M11.2 follow-up: CPU-only
    base sync at v=-1 for resume / non-equivalent ref_load" entry.

No code changes; spec-only document. 858 insertions / 235 deletions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TianyeGGBond TianyeGGBond merged commit 62c9426 into rlops:nemo May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants